VCF | 1000 Genomes

Are the 1000 genomes variant calls phased?

Answer:

You can tell when a VCF file contains a phased genotype as the delimiter used in the GT field is a pipe symbol | e.g

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00096
10   60523  rs148087467    T     G       100     PASS    AC=0;AF=0.01;AFR_AF=0.06;AMR_AF=0.0028;AN=2; GT:GL 0|0:-0.19,-0.46,-2.28

The VCF files produced by the final phase of the 1000 Genomes Project (phase 3) are phased. They can be found in the final release directory from the project and in the directory supporting the final publications.

The majority of the VCF files in official releases over the life time of the project contained phased variants. This is also true for the pilot, phase 1 and final phase 3 data sets.

The phase 1 release files contain global R2 values but you can also use the VCF to plink converter if you wish to use our files with haploview or another similar tool.

Are all the genotype calls in the 1000 Genomes Project current release VCF files bi-allelic?

Answer:

No. While bi-allelic calling was used in earlier phases of the 1000 Genomes Project, multi-allelic SNPs, indels, and a diverse set of structural variants (SVs) were called in the final phase 3 call set. More information can be found in the main phase 3 publication from the 1000 Genomes Project and the structural variation publication. The supplementary information for both papers provides further detail.

In earlier phases of the 1000 Genomes Project, the programs used for genotyping were unable to genotype sites with more than two alleles. In most cases, the highest frequency alternative allele was chosen and genotyped. Depth of coverage, base quality and mapping quality were also used when making this decision. This was the approach used in phase 1 of the 1000 Genomes Project. As methods were developed during the 1000 Genomes Project, it is recommended to use the final phase 3 data in preference to earlier call sets.

Are there any scripts or APIs for use with the 1000 Genomes data sets?

Answer:

There are a number of tools available in the Tools page of the 1000 Genomes Browser.

Our data is in standard formats like SAM and VCF, which have tools associated with them. To manipulate SAM/BAM files look at SAMtools for a C based toolkit and links to APIs in other languages. To interact with VCF files look at VCFtools which is a set of Perl and C++ code.

We also provide a public MySQL instance with copies of the databases behind the 1000 Genomes Ensembl browsers. These databases are described on our public instance page.

Can I convert VCF files to PLINK/PED format?

Answer:

We provide a VCF to PED tool to convert from VCF to PLINK PED format. This tool has documentation for both the web interface and the Perl script.

An example Perl command to run the script would be:

perl vcf_to_ped_converter.pl -vcf ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr13.phase1_integrated_calls.20101123.snps_indels_svs.genotypes.vcf.gz
    -sample_panel_file ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/phase1_integrated_calls.20101123.ALL.sample_panel
    -region 13:32889611-32973805 -population GBR -population FIN

Can I get genotypes for a specific individual/population from VCF files?

Answer:

Either the Data Slicer or using a combination of tabix and VCFtools allows you to sub sample VCF files for a particular individual or list of individuals.

The Data Slicer, described in more detail in the documentation, has both filter by individual and population options. The individual filter takes the individual names in the VCF header and presents them as a list before giving you the final file. If you wish to filter by population, you also must provide a panel file which pairs individuals with populations, again you are presented with a list to select from before being given the final file, both lists can have multiple elements selected.

To use tabix you must also use a VCFtools Perl script called vcf-subset. The command line would look like:

tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 17:1471000-1472000 | perl vcf-subset -c HG00098 | bgzip -c /tmp/HG00098.20100804.genotypes.vcf.gz

Can I get haplotype data for the 1000 Genomes individuals?

Answer:

The final data set produced by the 1000 Genomes Project was the phase 3 integrated data set. This contains fully phased haplotypes for 2,504 individuals. Full details can be found in the 1000 Genomes project phase 3 publication.

Can I use the 1000 genomes data for imputation?

Answer:

The developers of Beagle, Mach and Impute2 have all created data sets based on the 1000 Genomes data to use for imputation.

Please look at the software’s website to find those files.

How can I get the allele frequency of my variant?

Answer:

Our VCF files contain global and super population alternative allele frequencies. You can see this in our most recent release. For multi allelic variants, each alternative allele frequency is presented in a comma separated list.

An example info column which contains this information looks like

1 15211 rs78601809 T G 100 PASS AC=3050;AF=0.609026;AN=5008;NS=2504;DP=32245;EAS_AF=0.504;AMR_AF=0.6772;AFR_AF=0.5371;EUR_AF=0.7316;SAS_AF=0.6401;AA=t|||;VT=SNP

If you want population specific allele frequencies you have three options: * For a single variant you can look at the population genetics page for a variant in our browser. This gives you piecharts and a table for a single site. * For a genomic region you can use our allele frequency calculator tool which gives a set of allele frequencies for selected populations * If you would like sub population allele frequences for a whole file, you are best to use the vcftools command line tool.

This is done using a combination of two vcftools commands called vcf-subset and fill-an-ac

An example command set using files from our phase 1 release would look like

grep CEU integrated_call_samples.20101123.ALL.panel | cut -f1 > CEU.samples.list

vcf-subset -c CEU.samples.list ALL.chr13.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz | fill-an-ac |
    bgzip -c > CEU.chr13.phase1.vcf.gz
    </pre>

Once you have this file you can calculate your frequency by dividing AC (allele count) by AN (allele number).

Please note that some early VCF files from the main project used LD information and other variables to help estimate the allele frequency. This means in these files the AF does not always equal AC/AN. In the phase 1 and phase 3 releases, AC/AN should always match the allele frequency quoted.

How do I get a sub-section of a VCF file?

Answer:

There are two ways to get a subset of a VCF file.

The first is to use the Data Slicer tool from our browser which is documented here. This tool gives you a web interface requesting the URL of any VCF file and the genomic location you wish to get a sub-slice for. This tool also works for BAM files. This tool also allows you to filter the file for particular individuals or populations if you also provide a panel file.

The second method is using tabix on the command line. e.g

tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 2:39967768-39967768

Specifications for the VCF format, and a C++ and Perl tool set for VCF files can be found at vcftools on sourceforge

Please note that all our VCF files using straight intergers and X/Y for their chromosome names in the Ensembl style rather than using chr1 in the UCSC style. If you request a subsection of a vcf file using a chromosome name in the style chrN as shown below it will not work.

tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz chr2:39967768-39967768

What is the Data Slicer?

Answer:

The Data Slicer is a web based tool in our browser which allows you to get subsections of our indexed VCF and BAM files.

What is the depth of coverage of your Phase1 variants?

Answer:

The Phase 1 integrated variant set does not report the depth of coverage for each individual at each site. We instead report genotype likelihoods and dosage. If you would like to see depth of coverage numbers you will need to calculate them directly.

The bedtools suite provides a method to do this.

genomeCoverageBed is a tool which can provide a bed file which specifies coverage for every base in the genome and intersectBed which will provide an intersection between two vcf/bed/bam files

These commands also require samtools, tabix and vcftools to be installed

An example set of commands would be

samtools view -b  ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG01375/alignment/HG01375.mapped.ILLUMINA.bwa.CLM.low_coverage.20120522.bam 2:1,000,000-2,000,000 | genomeCoverageBed -ibam stdin -bg > coverage.bg

This command gives you a bedgraph file of the coverage of the HG01375 bam between 2:1,000,000-2,000,000

tabix -h http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/ALL.chr2.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz 2:1,000,000-2,000,000 | vcf-subset -c HG01375 | bgzip -c > HG01375.vcf.gz

This command gives you the vcf file for 2:1,000,000-2,000,000 with just the genotypes for HG01375

To get the coverage for all those sites you would use

intersectBed -a HG01375.vcf.gz -b coverage.bg -wb > depth_numbers.vcf

You can find more information about bed file formats please see the Ensembl File Formats Help

For more information you may wish to look at our documentation about data slicing

What do the names of your variant files mean and what format are the files?

Answer:

Our variant files are distributed in vcf format, a format initially designed for the 1000 Genomes Project which has seen wider community adoption.

The majority of our vcf files are named in the form:

**ALL.chrN

wgs

wex.2of4intersection.20100804.snps

indels

sv.genotypes.analysis_group.vcf.gz**.

This name starts with the population that the variants were discovered in, if ALL is specifed it means all the individuals available at that date were used. Then the region covered by the call set, this can be a chromosome, wgs (which means the file contains at least all the autosomes) or wex (this represents the whole exome) and a description of how the call set was produced or who produced it, the date matches the sequence and alignment freezes used to generate the variant call set. Next a field which describes what type of variant the file contains, then the analysis group used to generate the variant calls, this should be low coverage, exome or integrated and finally we have either sites or genotypes. A sites file just contains the first 8 columns of the vcf format and the genotypes files contain individual genotype data as well.

Release directories should also contain panel files which also describe what individuals the variants have genotypes for and what populations those individuals are from

What does the LDAF value mean in your phase1 VCF files?

Answer:

LDAF is an allele frequency value in the info column of our phase 1 VCF files.

Our standard AF values are allele frequencies rounded to 2 decimal places calculated using allele count (AC) and allele number (AN) values. LDAF is the allele frequency as inferred from the haplotype estimation.

You will note that LDAF does sometimes differ from the AF calculated on the basis of allele count and allele number. This generally means there are many uncertain genotypes for this site. This is particularly true close to the ends of the chromosomes.

What does this variant identifier mean?

Answer:

All of the 1000 Genomes SNPs and indels have been submitted to dbSNP, and will have rsIDs in the main 1000 Genomes release files. The SVs have all been submitted to DGVa and have esvIDs in the main files.

If you are using some of the older working files that were used during the data gathering phase of the 1000 Genomes Project, you may find some variants with other kinds of identifiers, such as Alu_umary_Alu_###. These identifiers were created internally by the groups that did that set of particular variant calling, and are not found anywhere other than these files, as they will have been replaced by official IDs in the later files.

What is a panel file?

Answer:

All our variant call releases since 20100804 have come with a panel file. This file lists all the individuals who are part of the release and the population they come from.

This is a tab delimited file which must have sample and population in its first two columns; some files may then have subsequent columns which describe additional information like which super population a sample comes from or what sequencing platforms have been used to generate sequence data for that sample.

The panel files have names like integrated_call_samples.20101123.ALL.panel or integrated_call_samples_v2.20130502.ALL.panel

These panel files can be used by our browser tools, the Data Slicer, Variant Pattern Finder and vcf to ped converter to establish population groups for filtering

What strand are the variants in your VCF file on?

Answer:

All the variants in both our VCF files and on the browser are always reported on the forward strand.

What structural variant data is available for the project?

Answer:

The project has two releases of structural variation. The pilot paper data directory contains vcf files for deletions, mobile element insertions, tandem duplications and novel sequence both for the low coverage and trio pilot studies. Our phase1 release integrated release contains deletions together with the SNPs and short indels.

What version of VCF are your VCF files in?

Answer:

The VCF files on our site cover a wide variety of different versions but our most recent release VCF files are in format version 4.1

Why is the allele frequency different from allele count/allele number?

Answer:

In some early main project releases the allele frequency (AF) was estimated using additional information like LD, mapping quality and Haplotype information. This means in these releases the AF was not always the same as allele count/allele number (AC/AN). In the phase 1 release AF should always match AC/AN rounded to 2 decimal places.

Why are there duplicate calls in the phase 3 call set

Answer:

The phase 3 VCF files released in June 2014 contain overlapping and duplicate sites.

This is due to an error in the processing pipeline used when sets of variant calls were combined. Originally, all multi-allelic sites were seperated into individual lines in the VCF file during the pipeline but the recombination process did not always succeed, leaving us with a small number of sites with overlapping or duplicate call records. This is most commonly seen in chromosome X.

The simplest solution to this is to ignore duplicate sites in any analysis. If you wish to use one or both of a pair of duplicate sites in your own analysis, you should use the GRCh37 alignment files to recall the genotypes of interest in the individuals you are interested in to resolve the conflict.

Why do some of your vcf genotype files have genotypes of ./. in them?

Answer:

Our August 2010 call set represents a merge of various different independent call sets. Not all the call sets in the merge had genotypes associated with them, as this merge was carried out using a predefined rules which has led to individuals or whole variant sites having no genotype and this is described as ./. in vcf 4.0. In our November 2010 call and all subsequent call sets all sites have genotypes for all individuals for chr1-22 and X.

Links

Answer:

Related questions:

Answer:

Related questions:

Answer:

Related questions:

Answer:

Related questions:

Answer:

Related questions:

Answer:

Related questions:

Answer:

Related questions:

Answer:

Related questions:

Answer:

Related questions:

Answer:

Related questions:

Answer:

Related questions:

Answer:

Related questions:

Answer:

Related questions:

Answer:

Answer:

Related questions:

Answer:

Related questions:

Answer:

Related questions:

Answer:

Related questions:

Answer:

Related questions:

Answer:

Related questions:

Answer:

Related questions: